feat: 增加 HTTP 页面语言识别#160
Draft
h3zh1 wants to merge 1 commit into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
背景
Spray 已经会在扫描过程中保存响应 header/body,并通过
extracts输出从响应中提取到的信息。本 PR 增加 HTTP 页面语言识别能力,将识别结果写入现有
extracts结构,不新增网络请求,也不扩展输出 schema。核心设计
复用已有响应:
Baseline已保存的Header和Body做解析输出边界清晰:
extracts识别来源:
Content-Languageheader<html lang>/xml:langog:locale改动内容
新增
pkg/http_language.go— HTTP 页面语言提取逻辑ExtractHTTPLanguage():从 header/body 提取语言信息HTTPLanguageExtract():转换为parsers.Extractedspkg/http_language_test.go— 语言提取单元测试core/baseline/http_language_test.go— baseline 集成测试Collect()后输出extracts[name=language]--finger/ finger engine修改
core/baseline/baseline.goCollect()流程中追加 language extractProtonExtract/CollectURL的响应信息提取模式保持一致输出示例
完整 JSON 结果中会追加到
extracts数组:{ "extracts": [ { "name": "language", "severity": "info", "extract_result": ["en"] } ] }测试计划
git diff --check origin/master..HEADgo test ./pkg/... ./core/baseline/... -vgo vet ./pkg ./core/baselinehttps://example.com输出extracts[name=language]已知情况
go test ./...当前仍失败在已有的TestE2E_CrawlHonorsMaxLength:crawl followed a link located beyond --max-length 1KB。该失败点在本改动范围外,本 PR 未修改 crawl 逻辑。